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Introduction. A rapid development of the systems such as 
Yandex, Google, etc., has predetermined the relevance of the 
task of searching substrings in a string, and approaches to its 
solution are actively investigated. This task is used to create 
database management systems that support associative search. 
Besides, it is applicable in solving information security issues 
and creating antivirus programs. Algorithms of searching sub- 
string in a string are used in signature-based discovery tasks. 
Materials and Methods. The solution to the problem is based 
on the Aho-Corasick algorithm which is a typical technique of 
searching substrings in a string. At the same time, a new ap- 
proach regarding preprocessing is employed. 

Research Results. The possibility of constructing the transition 
function and suffix references through suffix arrays and spe- 
cial mappings, is shown. The relationship between the prefix 
tree and suffix arrays was investigated, which provided the 
development of a fundamentally new method of constructing 
the transition and error functions. The results obtained enable 
to substantially shorten the time intervals spent on the pre- 
election processing of a set of pattern strings when using an 
integer alphabet. The paper lists eight algorithms. The devel- 
oped algorithms are evaluated. The results obtained are com- 
pared to the formerly known. Two theorems and eight lemmas 
are proved. Two examples illustrating features of the practical 
application of the developed preprocessing procedure are giv- 
en. 

Discussion and Conclusions. The preprocessing procedure 
proposed in this paper is based on the communication between 
the suffix array built on the ground of a set of pattern strings 


and the construction of transition and error functions at the 


* The research is done within the frame of the independent R&D. 
* E-mail: mazurencoal@gmail.com, boldyrikhin@mail.ru 
*““ Pa6ota BBITIOJIHeHa B pamkax uHnyMatTuBHon HUMP. 


Beedenue. bypHoe pa3BuTHe TakHx cucTeM, Kak Yandex, 
Google ump., upeqonpeyenHo0 akTyasIbHOCTb 3ayja4H MOWCKa 
MOACTpOK B crpoKe. Ha ceroqHAMIHH WeHb AKTHBHO UCCIIe- 
HY!OTCA NOAXOWbI K ee pelieHHtO. DTa 3afaya UCIOUL3yeTCA 
TIpH cO3aHHM CHCTeM yiipaBileHHa 6a3aMH JaHHBIX, Mol yep- 
2KUBaOWIMX aCCOWMATHBHBIM NOuCK. Kpome Toro, oHa pHMe- 
HMMa IIpH pelieHHu BorIpocos HHPopMalMOHHOH Oe3z0racHo- 
CTH, CO3aHHM AaHTHBUpyCHBIxX MporpamM. AsIropHTMbI MoucKa 
TIOACTPOK B CTPOKe HCMONb3yIOTCA B 3aqa4yax OOHapyKeHHA, 
OCHOBaHHOrO Ha CHrHaTypax. 

Mamepuaiei u memoodvi. Pemenue 3aqauH Oa3upyetca Ha 
amroputme Axo — Kopacuk, KoTopbiii mpeyctaBiaeT coor 
KIaCCHYeCKHH cHOcoO Ocyl{ecTBIIeHHA MOHCKa MOJCTpoK B 
crpoke. Bmecte c TeM IIPHMecHeH HOBBIM MOAXO B 4acTH, 
Kacalolelica IpeqBapuTesbHOH OOpadorKn. 

Pe3ynemamoti uccaedoeanua. IloKa3aHa BO3MO2KHOCTb HOCTpo- 
eHHA PYHKIUWH Wepexoya u cyPUKcHBIX CCbIIOK IPH MOMO- 
WIM cyPPuKCHBIX MaCCHBOB HM CIIeWMaIbHBIX OTOOpaxeHHi. 
UccneqoBaHa B3aHMOCBA3b MexKy MpePuKcHbIM epeBoM u 
cyPPUKCHBIMH MaccHBaMH. STO asl0 BO3MOXKHOCTH pa3paoo- 
TaTb MIPHHIMMHabHO HOBBIM clocoO Nocrpoesua dyHKuH 
Tlepexoyja HW OWIMOOK. 

Tlony4eHHble pe3yIbTaTbl MO3BOJIAIOT CyWCCTBeHHO COKpa- 
THTb BpeMA, 3aTpayMBaeMoe Ha pezBbIOopHy!1o OOpadoTKy 
MHOKECTBa CTPOK OOpa3{OB MpH HCTOIL30BaHHH WesOUHc- 
JI€HHOTO asipaBuTa. 

B crTaTbe MpHBeyeHO BOCeMb asIropuHTMoB. OleHeHb! pa3spado- 
TaHHble aropuTMbtl. IlomyaeHHbie pe3yIbTaTbI COMOCTaBJICHbI 
C paHee M3BeCTHBIMH. J[oKa3aHbI JIBe TeOpeMbl HM BOCeMb 
nemM. IIpupeyeHbl 2Ba TIpuMepa, WIKOCTpupyromHe ocobeH- 
HOCTH MpakTH4ecKoro MpHMeHeHHA paspaboTaHHON NpoLey- 
pbi npenpoweccuura. 

O6cyacoenue u 3axmiouenua. UpenioxenHasd B JaHHOM cTaTbe 
mpolleyypa MWpempoleccuHra OCHOBBbIBaeTCA Ha CBA3H Me*K Ly 
cyPUKCHBIM MaCCHBOM, CO3aHHbIM Ha OCHOBe MHO2KeCTBAa 
CTpoK oOpa3loB, HW MOcTpoeHHemM (byHKUHH Mepexoya u OUIN- 
Ook Ha HavasIbHBIX 9TaMax padoTE! asroputma Axo — Kopac- 
ux. Tako Moqxo OTIMYeH OT TpaqMyHOHHOTO vu TpeOyeT 
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initial stages of the Aho-Corasick algorithm. This approach 
differs from the traditional one and requires the use of algo- 
rithms providing a suffix array in linear time. Thus, the algo- 
rithms that enable to significantly reduce the time for prepro- 
cessing of a set of pattern strings under the condition of using 
a certain type of alphabet in comparison to the known ap- 
proach proposed in the Aho- Corasick algorithm are described. 
The research results presented in the paper can be used in anti- 
virus programs that apply searching for signatures of mali- 
cious data objects in the memory of a computer system. In 
addition, this approach to solving the problem on searching 
substrings in a string will significantly speed up the operation 


of database management systems using associative search. 


Keywords: string searching, Aho-Corasick algorithm, prefix 
tree, suffix array, information search, error function, transition 
function 
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HCHOJIb3OBaHHA AITOPHTMOB, MO3BOJIAIOMHX MOCTPOHTE cy(p- 
(PHKCHBIM MaccHB 3a JIMHeMHOe Bpema. Takum oOpa3om, onH- 
CaHbl aJITOPHTMbI, MO3BOJIAKOWIMe CyLeCTBeHHO COKpaTHTb 
BpeMaA Ha MpeyBapuTesbHyIo OOpaboTKy MHOXKeCTBa CTpOK 
oOpa3l0B Ip ycOBHH HCTIOIb30BaHHA ONpeseueHHOrO TuMa 
alaBHTa 10 CpaBHeHHIO C W3BeCTHBIM THOAXOJ0M, mpeso- 
2keHHBIM A. Axo u M. Kopacux. 

Pe3ylbTaTbl UcceqOBaHui, NpMBeeHHbIe B CTaTbe, MOryT 
ObITh IIPHMeHeHEI B AHTHBUPYCHBbIX TIporpaMMax, HCTIOJIb3y- 
TOWIMX TOHCK CHTHaTyp BPeXOHOCHBIX MHOpMallHOHHEIX 
OOBEKTOB B MaMATH BLIMMCIMTebHOM cucTembI. Kpome Toro, 
WaHHbIM NOAXO] K pelieHHIO 3aa4H NOWCKa MOACTpOKH B 
CTpOKe MO3BOJIMeT 3HAYHTeIbLHO YCKOpHTb paOoTy cHcTeM 
ynpaBieHua Oa3 aHHBbIX, MPHMeHAIOWIHX accOlMaTHBHBIli 
TIOHCK. 


Ksouesbie c10Ba: MOMCK HOACTpoKuH, amropuTm Axo — 
Kopacuk, mpepukcuoe fepeso, cybpukcubiii MaccuB, NOHCK 
nHpopMalHn, PyHKUHA OWHOOK, PyHKUHA Wepexoya. 


O6opaszey ona yumupoeanua: Ma3ypenko, A.B. YckopeHublit 
TipelpoleccuHHr B 3ayjaye MoucKa MOWCTpoK B cTpoxe / A. B. 
Mas3ypenko H. B. bongsipuxun // BecrHuk JOH. roc. TexH. 
yH-Ta. — 2019. — T. 19, Ne 3. — C. 290-300. https://doi.org/ 
10.23947/1992-5980-2019-19-3- 290-300 


Introduction. Nowadays, awareness of the cybersecurity of distributed information systems and individual 


computing facilities is growing essentially [1]. A range of such tasks is wide enough [1-10]. Of special interest is the 
creation of powerful antivirus software (SW). One of the most important tasks solved through such SW is searching 
substrings in a string [1, 5, 6, 10-13]. 

Materials and Methods. The task of substring searching is to find all the lines in the text T with a total length 
m matching any pattern from a given set of patterns P. Suppose that the sum of the lengths of all elements P consisting 
of characters of the alphabet J is n. A solution to this problem was proposed by A. Aho and M. Corasick [6, 10]. In their 


algorithm, the pre-election processing time is O(nll !) , and the search time is O(m\I [= k) . Here, k is a number of match- 


es found in the text with lines belonging to a set of samples. 

Currently, the task of finding a substring in a string is being intensely investigated for two reasons: 

- search engines are rapidly developing [11]; 

- the detection process in antivirus software products is based on signatures [1]. 

In this regard, algorithms have been created that have to be selected due to specific needs of the user. The latest 
results obtained under solving the problem of searching a set of substrings are described in [13]. 

The results presented in this paper are based on the relationship between the suffix array created from a set of 
pattern strings and the construction of transition and error functions at the initial stages of the Aho - Corasick algorithm. 
This approach differs from the traditional one and requires using algorithms to construct a suffix array in linear time. 


So, the paper describes the algorithms by which the pre-election processing time is reduced to O(n) : 


* — k 
Given the alphabet J, a set of patterns P = {PF Pyavnk et where P el , i=1,k. Let us denote byn =) 


Assume that the alphabet / is a limited range of integers. The boundary may depend on the length of the string in ques- 


P|. 
tions e/° or may involve an interval [0,c] where c is a positive integer: c = |s| . Let ¢ e J be an empty string. 


Let goto be a transition function and a failure — an error function. These modifications are concerned with the 
methods for constructing the mentioned functions used in the Aho — Corasick algorithm [6, 10]. 


Suppose SuffArr(s) is a certain algorithm for constructing a suffix array for a string se J “in linear time. A 
description of such algorithms can be found, for example, in [12—15]. 


Suppose x,y €J°. Then, /cp(x, y) is the largest common prefix of the strings x and y. 


Information technology, computer science, and management 


N 
\o 
_ 


http://vestnik.donstu.ru 


N 
\o 


Vestnik of Don State Technical University. 2019. Vol. 19, no. 3, pp. 290-300. ISSN 1992-5980 eISSN 1992-6006 
Becmunuk Jfoncxozo zocydapcmeennozo mexnuueckozo ynueepcumema. 2019. T. 19, Ne 3. C. 290-300. ISSN 1992-5980 eISSN 1992-6006 





Consider the string s <7 re s= s[s[O]s[1]...s[n — 1] . Let s[s[i]sfi +1]...5[ j]|be a substring s including characters 

from ito j wherei< j, i, j= 0,n—1. Let us denote Pp, by the suffix array corresponding to the string s. Suppose 
P. = Ps[Ps[0]p,[1.-ps[n I], 

that is s [ s[p,[0]]...8 [n -1]] < s| s[p,U]...s[” - 1] SX 8 [s [p,[n - 1]. [n -1]] . 

To construct a suffix array, the algorithm described in [15] will be used. 

Supposea; ¢/, a, #a;, lSi<jfSk+l1, <Q) <...<O%,,. Let Vbel a;<b, where 1<i<k+1. Grant- 
ing P# ©, alpha = {Oy ,05,...,0 40441} - 

Suffix Array Processing Algorithm p, 

Here, sel": s=0,P0,P,...0,R04,,, Rel, 1<i<k. 

Adaptation (s, p,,alpha) 

1. new_array<e€ 


2. for (i <|alpha ;i++) { 





3i<|s 





3.j7<-0 
4. while (s[stp,f41--sts|-1] i] ¢ alpha) { 


5. new_array ‘lL 7] < s| slp, {iI]...s{|s| - Ly] 
6.j<ejtl 

7.3 

8. } 

9. ordered _list [0] <new_array [0] 


10. for (i <hi< |s|—|alpha 





si++) { 
11.j7<0 

12. if (new _array[i] # new _array|i-1]) { 
13. ordered _list|i| < new_array[i| 
14.je-j+1 

15. } 


16. } 
17. return ordered_list 


Lemma 1. Let P={F,P,,..,P.}, »=0,Ra,P,...0,P.0,,,. Then the Adaptation algorithm builds an array of 


lexicographically ordered suffixes of the patterns belonging to P over the time O((s| 7 |alphal) : 

Proof. In the loop of 2—8, the construction of the new_array is performed, whose i-th element is a prefix of the 
corresponding suffix s which includes all the characters of this suffix starting with the zero position to its first element 
belonging to the set alpha. In this case, using the suffix array p,, all suffixes s are looped over according to their lexi- 
cographic order. Thus, the new_array consists of all suffixes of the patterns belonging to P according to their lexico- 
graphic sequencing, and the recurrence of some suffixes is possible. 

Note that all strings starting with characters belonging to the alpha array, that is, the first |alphal suffixes, are 
excluded from consideration. Then, in the loop of 10-16, using the new_array, the ordered_list array is constructed 
through eliminating repetition strings. To do this, due to the lexicographic sequence of the strings, it is sufficient to 
check whether the string in question coincides with the previous one. 


The loop of 2-8 is executed over the time O((s| - |aipha}) since all strings starting with characters belonging to 





the alpha array are excluded from consideration. In the loop 10-16, |s|—|a/pha| of string matches occur. Thus, we ob- 


tain an asymptotic estimate of O((s| —|alpha}) algorithm running time. The lemma is proved. 


Partitioning algorithm according to lexicographic sequencing 
Here, s is an array of lexicographically ordered strings. 
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DandC(s) 


1. sub[0] <0 

2.7< 0 

3. for (i <—0;i <|s|-1;i++) f 
4. if (s[i]# s[i+1]) { 

5. sub[ j] <i+1 

6.j< jt+l 

7. } 

8. sub[ j] <—|s| 

9. return sub 


Lemma 2. The DandC algorithm based on an array of lexicographically ordered strings s constructs a sub array 
consisting of positive integers that show the indices corresponding to the first strings among the strings with the first 


characters equal over the time O(|s|) : 
Proof. The boundary corresponding to the first character begins with 0, which corresponds to the assignment 
performed in step 1. In the loop of 3-7, the first characters of the i-th and (i + 1)-th strings are sequentially compared 


where i = 0,|s}— 2. If the characters are not equal, then the beginning of the boundary corresponding to the next charac- 








ter is written to the sub array. Otherwise, the loop execution continues. The right boundary of the last character corre- 
sponds to the number of strings in the s array (step 8). 


The comparison in step 4 occurs over the time O(1) , as the recording in step 5 and the increment in step 6 do. 
Thus, the loop of 3-7 is performed over the time O(|s|) . The lemma is proved. 

First link algorithm 

Here, tree is a tree, lex_ words eI , link _num is the number of some character in /Jex_ words string, v is a 
serial number of a new node that joins the node with the serial number node _ number . 

BuildFirstLink (tree&,lex _ words&,v&,link _num,node_number) 

1. new tree.node|v| 

2. tree.node [v] state <— lex_words [lex _ words[0]..lex _ words{link]] 

3. new tree.node [node 7 number | ink <— tree.node [v] 


4. tree.node [node _ number | .link.symbol <— lex_words [link i num| 

5. v<ev+l 

Lemma 3. The BuildFirstLink algorithm constructs a new node with the sequence number v and an arc leading 
from node_number to a new node y, in the tree over the time O(1). 

Substring link algorithm 

Here, tree is a tree, lex_ words eI * , v is a serial number of a new node that joins the node with the serial 
number start. 

BuildSubstringLink (tree&,lex _words&, v&, start) 


1. for (k < start;k < \lex _ words 





sk+ +) { 

2. new tree.node|v] 

3. tree.node [v] state < lex_words [lex _ words[0]..lex _ words[ k]] 
4. new tree.node[v—1] .link <— tree.node|v] 


5. tree.node [v - 1] Jink.symbol — lex_words [k] 


6. ve-v4+l 
Tet 
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Lemma 4. The BuildSubstringLink algorithm constructs new nodes in tree matching all prefixes of the string 
lex_words starting with the prefix lex _ words [lex _ words[0]...lex _ words[ start] over the time O(|lex_ words| — start) : 

Last link algorithm 

Here, tree isa tree, lex_ words eI i , v is a serial number of a new arc, J is an alphabet. 

BuildLastLink (tree&,lex _words,v,I) 

1. new tree.node|0| Jlink|y] <— tree.node|0| 

2. symbols <— © 


3. for (i <0;i< \lex _ words 





;i++) { 

4. symbols [i] < lex_words [i][0] 

5.j7<0 

6. for (i051 <|J|3i++) { 

7. if (J[i]¢ symbols) { 

8. tree.node|0| link|v] symbol[j] <— Ii] 
97 jt+l 

10. } 


11.} 
Lemma 5. The BuildLastLink algorithm builds a loop at the root node. Its marking corresponds to a set of sym- 


bols by which it is impossible to go to other nodes of the tree from the root node over the time O(|/ex _ words|+|/|) 


Transition Algorithm 
Here, /ex_ words is an array of lexicographically ordered strings. 


CreateLink (lex _ words) 


1. str<- OD 
2. sub <— DandC (lex _ words) 


3. vel 
4. tree<—- @ 


5. tree.node|0| State <— & 

6. for (i < 0;i <|sub| -1,i++) { 

7. BuildFirstLink (tree, lex _ words [sub[i]] 5v,0, 0) 

8. BuildSubstringLink (tree, lex _ words [sub[i]] »V; 1) 

9. for (j < subli]+1; j < subli+l]; j++) { 

10. temp <— \lcp (lex _ words{ j —1], lex _ words[_j})| +1 

11. z <tree.getStateNumber(Icp (dex _ words{ j —1],lex _ words{[_j] ) 
12. BuildFirstLink (tree, lex _ words{ j],v,temp,z) 


13. BuildSubstringLink (tree, lex _ words [J] sV, temp) 


14.} 
15. } 
16. BuildLastLink (tree, lex _ words,v,lex _ words) 


17. return tree 
Lemma 6. The CreateLink algorithm builds a prefix tree with a loop at the root node over the time 


O ( > < i, \lex _words{i i] 
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Proof. In step 2, the DandC algorithm is executed (see Lemma 2), after which, in step 5, the root node with the 
serial number 0 of the tree is created. Its state is taken equal to a blank string e . Consider the loop of 6-15 at the i-th 
step. 

In step 7, using the BuildFirstLink algorithm, a node is created whose state corresponds to the first character of 
the string lex _ words| subfi]] . Given the construction of the sub array, it can be argued that such a character has not oc- 
curred before among the first characters of the previous strings. Then, in step 8, the implementation of the BuildSub- 
stringLink algorithm sequentially creates nodes whose state matches all prefixes of the string lex _words[sub[i]| eXx- 


cluding the prefix built in the previous string. 
In the loop of 9-14, using BuildFirstLink and BuildSubstringLink algorithms, we perform the same actions 


with strings lying in an integer space [ subli]+1, subji +1)-1]. Since each such string has a common non-zero prefix 


with the previous string, the algorithm immediately switches to the state corresponding to the largest common prefix, 
starting with which, it is required to build new nodes. In step 16, using the BuildLastLink algorithm, a loop at the root 
node is created. 

Steps 12 and 13 are performed over the time 


O(1) + O(|lex_words[ i] —lep (lex _words| j —1],/ex__words[,j]) -1) = 


= O(|lex_words[ i] - llcp (lex _ words| j —1],/ex__words{j])|) ; 


Thus, it follows from Lemmas 2, 3, and 4 that the loop of 9-14 is executed over the time 


fy subli+l]-1 |iex _ words Li] - \lep (lex _ words Li - 1] ,/ex__ words ()) : 


J=subli}+1 


The loop of 6-14 is executed over the time 


[> |lex _ words [ust + 


Oe pllessord[i]-[leo(lex_words[—I]tes_words[j))}= 


j=subli]+1 
= fe llex_ words [sub{i]] —|lex _ words ; 


It follows from Lemma 5, that step 16 is performed over the time 
lex _words|-1 
of ler_words|+ - | \lex _words i). Thus, we obtain an asymptotic estimate of the running time of the 


algorithm 


O(|lex _ words|) + 2 pad |lex _ words [sub[i]]| —|lex _ words + 


+O} |lex__ words|+ ee lex _words|i|| |=O ia ii lex _words|i}| |. 
| | 


i=0 i=0 


The lemma is proved. 
The goto function algorithm 

Here, P is a set of pattern strings. 
ConstructGoto (P) 
l.s<a,Fa,P,...0,P.0;4; 

2. p, < SuffArr(s) 

3. alpha < {0 ,02,...0; 50,41} 

4. ordered_list  Adaptation(s, p,,alpha) 
5. 7<0 

6. lex_words << ©O 
7. P_length — {|R P, 


Py peers 














P|} 


8. for (i < 0;i< |ordered _list 





;i++) 
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9. if ((jordered _list{i]| € P_length) and (ordered _list{i]é€ P)) 
10. lex_words[j ++]< ordered _list[i] 


11. goto < CreateLink(lex _ words) 


12. return goto 
k 
We should remind that n -)) al for a set of patterns P={P, P,,...,P,}. 


Theorem 1. The ConstructGoto algorithm develops the goto function over the time O(n). 


Proof. In step 2, a suffix array p, for the string s is constructed. In step 4, using the Adaptation algorithm, all 


suffixes of the strings belonging to a set of patterns P are written to the ordered_list array. In this case, recurrences are 
not excluded. In the loop of 8-10, an array /ex_words containing suffixes belonging to P and arranged in lexicographic 
sequence without recurrences is constructed. In step 11, a prefix tree is built with a loop at the root node based on the 
strings contained in the /ex_words array. The data structure returned by the CreateLink algorithm defines exactly the 
goto function. 

Step 2 is completed over the time O(n +k+1) [12]. From Lemma 1, it follows that step 4 is completed over 


the time O(n +k+1—-k-—- 1) = O(n) . In the loop of 8-10, only strings whose length is equal to the length of any pattern 


are considered. 
Thus, no more than O(n) checks are needed to find patterns of P. From Lemma 6, it follows that step 11 is 
\l ex__ words |-l 


completed over the time [> ‘ 


i= 


ex _ wore = O(n) . Sincek <n, we obtain an asymptotic estimate of 


the running time of the O(n)+ O(n +k+ 1) = O(n) algorithm. The theorem is proved. 


Research Results 







































































Example 1. 
Suppose P= {one, on, once, cell, lull, eye, near} . Then 
S = a,onea,ona,oncea,cella;lullageyea7nearda . (1) 
Table 1 shows the result of the goto function algorithm on the entry of the string s (1). 
Table 1 
Prefix tree structure 
node number | node state link branched states from node symbols on /ink branches from node 

0 € 1. c; 2. e; 3.44.1; 5.0 1. c; 2. e; 3. 15 4.0; 5.0 
1 Cc 1. ce lie 
2 ce 1. cel 1.7 
3 cel 1. cell 1.1 
4 cell — — 
5 e ey 
6 ey 1. eve lie 
7 eye 
8 l 1. lu liu 
9 lu 1. lul 1.1 
10 lul 1. lull 1.1 
11 lull — = 
12 n 1. ne lle 
13 ne 1. nea l.a 
14 nea 1. near lr 
15 near - — 
16 O l. on l.n 
17 on 1. onc; 2. one l.c; 2. e 
18 onc 1. once lie 
19 once - — 
20 one = — 
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Suppose § = a,,;P.a,...0,P0,P,d9 is mirroring of the string s. 
Failure function algorithm 
Here, P is a set of pattern strings. 


FalseSuff(P) 

1. FO 0,,)P.0,...0,P.0, Po 

2. ps < SuffArr(S) 

3. alpha < {0 ,02,...0; 50,41} 

4. ordered_list <- Adaptation(§, p;,alpha) 
5. link © 

6. for (i051 <5 
7. inLink[i] << 

8. sub <- DandC( ordered _list) 
9. str<-D 

10. for (i <—0;i< |swb| -1,i++) 





;i++) 


11. for ; < subji]; j < subji+1]-1; ++) 

12. str[j] <|lcp(ordered _list{ j],ordered _list{ j +1])| 
13. for (i <— 031 < |sub|—1;i++) 

14. for (k <—subli+1]—1;k > subli];k—-—) { 

15. for (i < subji]; 7 <k; j++) 

16. min_element — min (sér[k —]], str[k —2],..., str ]) 
17. if (min _ element =|ordered _list{j])) 

18. min _temp| j — sufi] <min_element 


19. 
20. a < max (min _ temp[0], min _ temp[I],..., min _temp[w]) 
21. Hatitu max_index: min_ temp [max_index ] = max_element 

22. inLink[k] < ordered _list [max_ index + subi] 

23. } 

24. for (i <0; < |inLink 





31 ++) { 

25. link{i][0] < ordered _list{i] ; //string mirroring 

26. link{i][1] <inLink{i] ; // string mirroring 

27. } 

28. return link 

Remark. In string 20, w< sub[i+1]—sub[i]—-1. 

Theorem 2. The FalseSuff algorithm constructs the failure function over the time O(n) . Proof. In step 1, we 


construct an array of characters that contains mirror images of strings belonging to a set of patterns P and some unique 
characters. In step 2, we construct a suffix array p; for the string S . In step 4, using the Adaptation algorithm, all suf- 


fixes of the strings belonging to a set of patterns P (a set of patterns consisting of mirrored strings P) are written to the 
ordered _list array, and recurrences are not excluded. 

In step 8, the DandC algorithm is executed (see Lemma 2), after which, in the loop of 10-12, we find the 
length of the largest common prefix between the strings that match the first character. We write the result to the str ar- 
ray. Note that this value is zero for the strings for which this condition is not satisfied. In the loop of 13—23, a special 
mapping is constructed between the strings for which the first character matches. We describe this mapping. Indicate 
some string 


s €ordered _ list [ ordered _list [ sub[i]] ,---, ordered _ list [ subli +1]- 1] E 
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Consider a set of strings belonging to ordered _list [ ordered _ list [sub[i]],...,ordered _ list [subfi +1] -1]] . Their length 


is equal to the length of the largest common prefix with s excluding s itself. From this set, we find the string s’ that has 
an overall length, and assign it to s. Obviously, the constructed mapping is a bijection under the condition of s’ # € . The 
result is written to the inLink array. In the loop of 24—27, using the inLink array, we explicitly indicate the constructed 


~ ~ 


mapping while mirroring each of the strings. Thus, we assign the 5’ node to the § node of the prefix tree constructed 
on the basis of the array of patterns P. Its state is equal to the largest proper suffix § that occurs among the many states 
of the considered prefix tree. But according to the definition of the failure function, this is the desired result. 


k 
Suppose n =) Jl . Step 2 is completed over the time O(n+k +1) [12]. From Lemma 1, it follows that 


step 4 is performed over the time O(n +k +1—k-1)=O(n). The loop of 6-7 is executed over the time O(|§|) = O(n). 


From Lemma 2, it follows that step 8 is performed over the time O(n). The loop of 10-12 is executed over the time 


j=subli}+1 r j 


sub[i+1]-1 sub[i+l]—sub[i]—2 : ; : 
fe sub} ye =subti] 11) fn 1 }= 0 (sub{i+ t= sub{)-1) : W Y qt . 


Then the loop of 13—23 is executed over the time 
|sub|—2 . . _ _ 
ofr (subli+1]—sub{i] -1) = O(sub [|sub]-1]) = 


Since |inLink|<n, then the loop of 24-27 is executed over the time O(|inLink|) = O(n). Thus, considering 


|sub|- 2 sublit+l]— . : 
Oy: bas = =O(k), Vj y ;=1. The loop of 14-23 is completed over the time 


that k <n, we obtain an asymptotic estimate of the running time of the O(n)+O(n+k+1)+O(k) = O(n) algorithm. 


The theorem is proved. 
Example 2. 
Suppose, as in example 1, P = {one,on, once, cell, lull, eye,near} . Then 
5 = dgraena,eved.llula/leca,ecnoa,noa,enod, . (2) 


Table 2 shows the result of the failure function algorithm on the entry of the string § (2). 






































Table 2 
False links between nodes 

inLink array link array inLink array link array 
0 € 0 0. nea; 1. 10. 1 10. 0. cel; 1.1 
1 € 1 O.c;l.¢ ll. 7 ll. 0. cell; 1.1 
2 Cc 2 0. onc; 1.c 12. 1 12. 0. lull; 1.1 
3 € 3 O.e; lis 13. 1 13. 0. dud; 1.1 
4 e 4 0. ce; l.e 14. & 14. On lie 
5 ec 5 0. once; 1. ce 15. n 15. 0. on; 1.n 
6 e 6 0. ne; l.e 16. € 16. 0.0; 1. € 
7 en 7 0. one; 1. ne 17. € 17. O. near; 1. € 
8 e 8 0. eye; l.e 18. € 18. O.lu;l.e 
9 € 9 O.1.¢ 19, € 19. O.ey; lie 














For all nodes for which Fig. 1 does not show false links, we believe that a false link leads to a root node. 
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Fig. 1. Prefix tree with false links 


Discussion and Conclusions. A new preprocessing procedure in the Aho-Corasick algorithm is described. It 
runs in the linear time O(n) . The connection between the suffix arrays and a prefix tree was investigated which allowed 


us to propose a different way of constructing transition and failure functions. The results obtained provide reducing the 
time on the pre-election processing of a set of pattern strings when using the integer alphabet. 
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